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Abstract 

Redundant  disk  arrays  provide  a  way  for  achieving 
rapid  recovery  from  media  failures  with  a  relatively  low 
storage  cost  for  large  scale  database  systems  requiring 
high  availability.  In  this  paper  we  propose  a  method  for 
using  redundant  disk  arrays  to  support  rapid  recovery 
from  system  crashes  and  transaction  aborts  in  addi¬ 
tion  to  their  role  in  providing  media  failure  recovery. 
A  twin  page  scheme  is  used  to  store  the  parity  infor¬ 
mation  in  the  array  so  that  the  time  for  transaction 
commit  processing  is  not  degraded.  Using  an  analyti¬ 
cal  model,  we  show  that  the  proposed  method  achieves 
a  significant  increase  in  the  throughput  of  database 
systems  using  redundant  disk  arrays  by  reducing  the 
number  of  recovery  operations  needed  to  maintain  the 
consistency  of  the  database. 


1  Introduction 

In  a  database  system,  rapid  recovery  may  be  nec¬ 
essary  for  restoring  the  database  to  a  consistent  state 
after  a  failure.  Several  types  of  failures  can  occur.  The 
most  typical  are  transaction  aborts  which  can  be  due 
to  program  errors,  deadlocks,  or  can  be  user  initiated. 
When  a  transaction  aborts,  the  recovery  manager  has 
to  restore  all  database  pages  modified  by  the  trans¬ 
action  to  their  previous  state.  The  second  type  of 
failure  is  a  system  crash.  In  this  case  system  tables 
maintained  in  main  memory  are  lost.  The  recovery 
mechanism  has  to  UNDO  all  updates  made  to  the 
database  by  transactions  that  were  active  when  the 
crash  occurred  and  to  REDO  modifications  performed 
by  complete  transactions  and  not  yet  reflected  in  the 
database  at  the  time  of  the  crash. 
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Another  type  of  failure  is  media  failure.  One  com¬ 
mon  way  to  deal  with  this  type  of  failure  is  by  periodi¬ 
cally  generating  archive  copies  of  the  database  and  by 
logging  updates  to  the  database  performed  by  commit¬ 
ted  transactions  between  archive  copies  into  a  redo  log 
file.  When  a  media  failure  occurs  the  database  is  re¬ 
constructed  from  the  last  copy  and  the  log  file  is  used 
to  apply  all  updates  performed  by  transactions  that 
committed  after  the  last  copy  was  generated.  In  such 
a  case,  a  media  failure  causes  significant  down  time 
and  the  overhead  for  recovery  is  quite  high.  For  large 
systems,  e.g.,  with  over  50  disks,  the  mean  time  to 
failure  (MTTF)  of  the  permanent  storage  subsystem 
can  be  less  than  25  days1.  Mirrored  disks  have  been 
employed  to  provide  rapid  media  recovery  [1].  How¬ 
ever,  disk  mirroring  incurs  a  100%  storage  overhead 
which  is  prohibitive  for  many  applications.  Redun¬ 
dant  Disk  Array  (RDA)  organizations  [2,  3]  provide  an 
alternative  for  maintaining  reliable  storage.  However, 
even  when  disk  mirroring  or  RDAs  are  used,  archiving 
and  redo  logging  may  still  be  necessary  to  protect  the 
database  against  operator  errors  or  system  software 
design  errors. 

In  this  paper,  we  present  a  technique  that  exploits 
the  redundancy  in  disk  arrays  to  support  recovery 
from  transaction  and  system  failures  in  addition  to 
providing  fast  media  recovery.  This  is  achieved  by 
using  a  twin  page  scheme  for  storing  the  parity  infor¬ 
mation  making  it  possible  to  keep  the  old  version  of 
the  parity  along  with  the  new  version.  The  old  ver¬ 
sion  of  the  parity  is  used  to  undo  updates  performed 
by  aborted  transactions  or  by  transactions  interrupted 
by  a  system  failure. 

In  Sections  2  and  3  we  briefly  review  several  tech¬ 
niques  for  transaction  recovery  in  database  systems 
and  discuss  two  RDA  organizations.  In  Section  4,  we 
present  our  database  recovery  scheme.  The  results  of 
our  performance  analysis  are  detailed  in  Section  5. 

1  Assuming  an  MTTF  of  30,000  hours  for  each  disk. 
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2  Recovery  Techniques 


Recovery  algorithms  typically  use  some  form  of  log¬ 
ging  or  shadowing.  In  the  logging  approach  [4],  be¬ 
fore  a  new  version  ( after-image )  of  a  record  or  page 
is  written  to  the  database,  a  copy  of  the  old  version 
( before-image )  is  placed  into  a  sequential  log  file.  If  a 
transaction  aborts  or  the  system  crashes,  the  log  file 
is  analyzed  and  the  state  of  the  database  is  restored. 
In  the  shadowing  approach  the  update  of  a  page  is 
placed  into  a  new  physical  page  on  disk  [5,  6].  The 
physical  pages  containing  the  old  versions  are  released 
after  all  updates  of  the  committing  transaction  have 
been  written  to  disk.  One  problem  with  the  shadowing 
approach  is  dynamic  mapping  since  it  requires  main¬ 
taining  a  very  large  page  table  which  leads  to  high  I/O 
overhead  during  normal  processing.  Another  problem 
is  the  disk  scrambling  effect  which  decreases  the  se¬ 
quentiality  of  disk  accesses. 

In  describing  and  in  analyzing  our  method,  we  will 
use  the  following  taxonomy  of  database  recovery  algo¬ 
rithms  introduced  by  Haerder  and  Reuter  [7].  They 
classify  recovery  algorithms  with  respect  to  the  follow¬ 
ing  four  concepts: 

Propagation2  of  updates.  The  propagation 
strategy  can  be  ATOMIC  in  which  case  any  set  of 
updated  pages  can  be  propagated  to  the  database  in 
one  atomic  action.  In  the  -> ATOMIC  case,  propaga¬ 
tion  of  updates  can  be  interrupted  by  a  system  crash 
and  database  pages  are  updated-in-place. 

Page  replacement.  Two  policies  can  be  used: 
the  STEAL  policy  allows  pages  modified  by  uncom¬ 
mitted  transactions  to  be  propagated  to  the  database 
before  end-of- transaction  (EOT);  the  opposite  policy 
is  referred  to  as  -<STEAL.  No  UNDO  recovery  is  nec¬ 
essary  with  a  ->STEAL  policy. 

EOT  processing.  Two  categories  exist:  the 
FORCE  discipline  requires  all  pages  modified  by  a 
transaction  to  be  propagated  before  EOT;  the  oppo¬ 
site  discipline  is  called  -y  FORCE. 

Checkpointing  Schemes.  Checkpointing  is  used 
to  propagate  updates  to  the  database  in  order  to  min¬ 
imize  the  number  of  REDO  recovery  actions  to  be 
performed  after  a  crash.  In  the  Transaction  Oriented 
Checkpointing  (TOC)  scheme,  a  checkpoint  is  gener¬ 
ated  at  the  end  of  each  transaction.  This  is  equivalent 
to  using  the  FORCE  discipline  in  EOT-processing. 
Two  other  types  of  checkpoints  can  be  used:  Transac¬ 
tion  Consistent  Checkpoints  (TCC)  are  generated  dur¬ 
ing  quiescent  periods  where  no  transactions  are  being 


2  Propagation  to  the  database  means  that  the  new  version  is 
visible  to  higher  level  software.  Updates  can  be  written  to  disk 
without  being  propagated  (e.g.,  shadowing). 


Figure  1:  RAID  with  rotated  parity  on  four  disks. 


processed,  Action  Consistent  Checkpoints  (ACC)  are 
less  restrictive  and  require  that  no  update  statements 
are  processed  during  checkpoint  generation. 


3  Redundant  Disk  Arrays 


Striped  disk  arrays  have  been  proposed  and  imple¬ 
mented  for  increasing  the  transfer  bandwidth  in  high 
performance  I/O  subsystems  [8,  9,  10].  In  order  to 
allow  the  use  of  a  large  number  of  disks  in  such  ar¬ 
rays  without  compromising  the  reliability  of  the  I/O 
subsystem,  redundancy  is  sometimes  included  in  the 
form  of  parity  information  [3,  10].  Patterson  et  al.  [3] 
have  presented  several  possible  organizations  for  Re¬ 
dundant  Arrays  of  Inexpensive  Disks  (RAID).  One  in¬ 
teresting  organization  is  RAID  with  rotated  parity  in 
which  blocks  of  data  are  interleaved  across  N  disks 
while  the  parity  of  the  N  blocks  is  written  on  the 
N  4-  1st  disk.  The  parity  is  rotated  over  the  set  of 
disks  in  order  to  avoid  contention  on  the  parity  disk. 
Figure  1  shows  the  array  organization  with  four  disks. 
The  organization  allows  both  large  (full  stripe)  con¬ 
current  accesses  or  small  (individual  disk)  accesses. 
In  this  paper,  we  concentrate  on  small  read/write  ac¬ 
cesses.  For  a  small  write  access,  the  data  block  is  read 
from  the  relevant  disk  and  modified.  To  compute  the 
new  parity,  the  old  parity  has  to  be  read,  XORed  with 
the  new  data  and  XORed  with  the  old  data.  Then  the 
new  data  and  new  parity  can  be  written  back  to  the 
corresponding  disks.  Stonebraker  et  al.  [11]  have  ad¬ 
vocated  the  use  of  a  RAID  organization  to  provide 
high  availability  in  database  systems. 

Gray  et  al.  [2]  studied  ways  of  using  an  architec¬ 
ture  such  as  RAID  in  on-line  transaction  processing 
(OLTP)  systems.  They  found  that  because  of  the  na¬ 
ture  of  I/O  requests  in  OLTP  systems,  namely  a  large 
number  of  small  accesses,  it  is  not  convenient  to  have 
several  disks  servicing  the  same  request.  Hence,  the 
organization  shown  in  Figure  2  was  proposed.  It  is  re¬ 
ferred  to  as  parity  striping.  It  consists  of  reserving  an 
area  for  parity  on  each  disk  and  writing  data  sequen¬ 
tially  on  each  disk  without  interleaving.  For  a  group 
of  N+ 1  disks,  each  disk  is  divided  into  N  + 1  areas  one 
of  these  areas  on  each  disk  is  reserved  for  parity  and 
the  other  areas  contain  data.  N  data  areas  from  N 
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Figure  2:  Parity  striping  of  disk  arrays. 


different  disks  are  grouped  together  in  a  parity  group 
and  their  parity  is  written  on  the  parity  area  of  the 
N  +  1“  disk. 


4  RDA-Based  Recovery 

In  the  remainder  of  this  paper,  we  consider  an  I/O 
subsystem  that  is  a  collection  of  redundant  disk  ar¬ 
rays.  The  organization  of  the  arrays  being  either  par¬ 
ity  striping  or  data  striping  (RAID  with  rotated  par¬ 
ity).  In  the  case  of  data  striping  we  assume  that  a 
luge  striping  unit  is  used  in  order  to  ensure  that  I/O 
requests  will  typically  be  serviced  by  a  single  data  disk. 
We  also  make  the  following  assumptions:  Communi¬ 
cation  between  main  memory  and  the  I/O  subsystem 
is  performed  using  fixed  size  pages;  Database  pages 
are  updated  in  place  which  implies  that  propagation 
is  ~>ATOMIC;  A  STEAL  policy  is  used  thus  allowing 
modified  pages  to  be  propagated  before  EOT. 

4.1  General  Description  of  the  Approach 

RDA-based  recovery  makes  use  of  the  parity  infor¬ 
mation  present  in  the  disk  arrays  to  undo  updates  per¬ 
formed  by  aborted  transactions.  However,  the  parity 
is  not  sufficient  by  itself  to  undo  all  updates  performed 
by  an  aborted  transaction.  Updates  that  cannot  be 
undone  using  the  parity  are  dealt  with  using  a  log  file. 

A  page  parity  group  is  the  set  of  pages  that  share 
the  same  parity  page.  In  the  following,  unless  there 
is  ambiguity,  we  will  use  the  term  parity  group  to  de¬ 
note  a  page  parity  group.  A  parity  group  can  be  in 
one  of  two  states:  clean  or  dirty.  A  parity  group  is 
dirty  when  one  of  its  data  pages  has  been  modified 
by  a  transaction  and  the  modified  version  has  been 
written  back  to  the  database  before  the  transaction 
modifying  it  commits  (using  the  notation  of  Haerder 
and  Reuter,  the  page  has  been  stolen  from  the  buffer). 
Otherwise  the  parity  group  is  called  clean.  Only  one 
modified  data  page  per  parity  group  can  be  written 
back  to  the  database  by  uncommitted  transactions 


without  UNDO  logging.  If  additional  pages  in  the  par¬ 
ity  group  have  been  modified  and  need  to  be  written 
back  to  the  database  then  their  before-images  must 
be  logged  first.  A  dirty  parity  group  goes  back  to  the 
clean  state  when  the  transaction  that  caused  it  to  be¬ 
come  dirty  commits.  A  table  in  main  memory  contains 
the  numbers  of  all  parity  groups  that  are  in  the  dirty 
state.  It  also  contains  the  number  of  the  data  page 
within  the  group  that  caused  the  group  to  be  in  the 
dirty  state  and  the  number  of  the  parity  page  holding 
the  updated  parity.  Only  log  N  bits  need  to  be  used  to 
store  the  data  page  number  and  one  bit  for  the  parity 
page  number.  The  table  is  used  to  check  whether  a 
page  updated  by  an  active  transaction  can  be  written 
back  to  disk  without  UNDO  logging. 

When  a  transaction  updates  a  page,  that  page  can 
be  written  back  to  the  database  without  UNDO  log¬ 
ging  if  its  parity  group  is  clean  or  if  its  parity  group  is 
dirty  and  the  update  is  for  the  same  page  that  caused 
the  group  to  move  into  the  dirty  state,  i.e. ,  the  same 
page  has  been  updated,  stolen  from  the  buffer  then 
rereferenced  by  the  same  transaction,  updated  and 
stolen  again  from  the  buffer  before  EOT3.  Note  that 
this  does  not  affect  the  degree  of  concurrency  or  inter¬ 
fere  with  the  locking  policy  used  in  the  system.  We  do 
not  specify  when  a  transaction  can  or  cannot  modify 
a  page.  We  only  specify  when  a  modified  page  can  be 
written  back  to  disk  without  UNDO  logging. 

If  a  single  parity  page  is  used,  then  when  a  group 
becomes  dirty  the  old  parity  information  has  to  be 
kept  in  the  parity  page  to  be  able  to  recover  in  case  of 
a  transaction  failure.  That  would  mean  that  when  the 
transaction  commits,  the  new  parity  has  to  be  recom¬ 
puted  in  order  to  update  the  parity  page.  That  would 
require  reading  all  the  data  pages  in  the  group  in  or¬ 
der  to  compute  the  new  parity.  To  avoid  that  problem 
a  twin  page  scheme  is  used  for  the  parity  pages.  The 
basic  mechanism  of  the  twin  page  scheme  is  as  follows: 
one  of  the  parity  pages  always  contains  the  valid  par¬ 
ity  of  the  group  while  the  other  page  contains  obsolete 
parity  information.  When  a  data  page  is  modified  in 
a  parity  group,  the  obsolete  parity  page  (P  for  exam¬ 
ple)  is  updated  with  the  new  parity  of  the  array.  If 
the  transaction  performing  the  update  commits  then 
the  modified  parity  page  (P)  becomes  the  valid  parity 
page  otherwise  the  other  parity  page  (P')  remains  the 
valid  parity  page  and  its  contents  are  used  to  recover 
the  data  page  that  was  modified  by  the  failed  trans¬ 
action.  In  order  to  recover  the  old  version  of  a  data 
page  after  a  transaction  abort  it  is  sutficient  to  XOR 

3  Normally  such  an  event  should  not  occur  often  since  buffer 
management  algorithms  are  not  supposed  to  replace  a  page  that 
will  be  referenced  again  in  the  near  future. 


the  contents  of  both  parity  pages  and  the  new  data 
page:  D0id  =  (P  0  P')  0  Dnew-  When  a  parity  group 
is  dirty  because  one  of  its  data  pages  D,-  has  been 
stolen  from  the  buffer  and  another  page  D  f  needs  to 
be  written  to  disk,  UNDO  logging  must  be  performed 
for  D j*  then  both  parity  pages  P  and  P'  need  to  be 
updated  since  when  the  group  is  dirty  it  is  necessary 
to  maintain  a  current  parity  page  reflecting  the  actual 
parity  of  the  data  on  disk  and  an  “old”  parity  page 
that  would  be  used  to  recover  the  uncommitted  data 
page  Di  in  case  of  a  transaction  abort.  In  all  cases, 
when  writing  a  data  page  to  disk  the  corresponding 
parity  page(s)  must  be  updated  first. 

4.2  Twin  Page  Management 

The  twin  parity  pages  are  stored  on  different  disks. 
This  is  necessary  in  order  to  be  able  to  perform  trans¬ 
action  recovery  following  a  disk  failure.  In  order  to 
identify  which  of  the  twin  parity  pages  contains  the 
valid  parity  information,  a  timestamp  is  stored  in  the 
page  header.  The  page  with  the  highest  timestamp 
contains  the  valid  parity  information.  When  an  up¬ 
date  is  undone  after  a  transaction  or  system  failure, 
the  timestamp  of  the  current  parity  page  is  reset  to  0. 
When  a  data  page  is  updated  both  parity  pages  are 
read  and  the  one  with  the  highest  timestamp  is  se¬ 
lected  for  modification.  Then  the  parity  is  computed 
and  the  modified  parity  page  is  written  back  to  disk. 
In  order  to  avoid  reading  both  parity  pages,  a  bit  map 
can  be  maintained  in  main  memory  indicating  which  is 
the  current  parity  page  for  each  of  the  parity  groups 
in  the  database.  However  such  a  bit  map  may  not 
survive  a  system  crash.  Hence  following  a  crash  that 
destroys  the  map,  both  parity  pages  will  have  to  be 
read  to  identify  the  current  parity  page  and  to  recon¬ 
struct  the  bit  map.  In  this  case,  two  bits  would  have 
to  be  used  in  the  bit  map  for  each  parity  group  to 
encode  the  three  possible  states:  parity  page  P  is  the 
current  parity  page,  parity  page  P'  is  the  current  par¬ 
ity  page  or  the  information  is  not  available  and  both 
pages  have  to  be  read  from  disk.  Following  a  system 
crash  a  background  process  that  runs  during  idle  pe¬ 
riods  of  the  system  can  be  initiated  to  reconstruct  the 
bit  map. 

4.3  Recovery  from  System  Failure 

Following  a  system  crash  we  need  to  identify  which 
transactions  have  to  be  backed  out  and  which  pages 

4  The  before-image  of  the  page  in  the  case  of  page  logging  or 
of  the  modified  record (»)  in  the  case  of  record  logging  must  be 
written  to  a  log  file. 


have  been  modified  on  disk  by  those  transactions.  A 
Begin-Of-Transaction  (BOT)  record  needs  to  be  writ¬ 
ten  to  a  log  file  after  the  transaction  begins  and  before 
it  writes  back  any  modified  pages  to  disk  and  an  EOT 
record  must  be  written  to  the  log  file  when  the  trans¬ 
action  commits.  Modified  database  pages  for  which 
UNDO  logging  has  been  performed,  can  be  recovered 
by  reading  their  before-images  from  the  log.  Modified 
database  pages  for  which  UNDO  logging  has  not  been 
performed  can  be  recovered  using  the  parity  pages. 
However  information  on  which  pages  have  been  writ¬ 
ten  to  the  database  without  UNDO  logging  has  to  be 
saved  in  permanent  storage.  To  solve  this  problem,  a 
technique  similar  to  the  one  used  in  TWIST  [12]  can 
be  employed.  In  TWIST,  a  twin  page  scheme  is  used 
to  store  all  database  pages,  no  before-image  logging  is 
performed  and  the  same  problem  of  identifying  which 
pages  to  undo  after  a  crash  is  encountered.  The  solu¬ 
tion  makes  use  of  a  log  chain  which  consists  of  point¬ 
ers  stored  in  the  page  headers  that  link  together  pages 
modified  by  the  same  active  transaction.  In  our  case, 
only  modified  pages  written  back  to  the  database  be¬ 
fore  EOT  without  UNDO  logging  will  be  part  of  the 
log  chain.  The  head  of  the  chain  though  has  to  be 
logged  along  with  the  transaction  id.  I/O  operations 
to  maintain  the  log  chain  can  be  hidden  behind  regu¬ 
lar  I/O  requests  and  do  not  affect  significantly  system 
performance. 


5  Performance  Analysis 

In  order  to  evaluate  the  benefit  of  RDA-recovery, 
we  develop  an  analytical  model  to  evaluate  transac¬ 
tion  throughput  for  different  algorithms.  Since  the 
cost  of  maintaining  parity  information  in  a  system 
with  redundant  disk  arrays  is  relatively  high,  we  do 
not  advocate  the  use  of  RDAs  solely  for  the  purpose 
of  supporting  transaction  and  crash  recovery.  We  look 
at  the  benefit  of  using  RDA  recovery  in  a  system  that 
already  needs  RDAs  for  the  purpose  of  rapid  media 
recovery.  We  do  this  by  comparing  the  throughput 
of  systems  using  traditional  recovery  algorithms  and 
redundant  disk  arrays  to  systems  with  the  same  re¬ 
covery  algorithms  in  combination  with  RDA  recovery. 
We  consider  both  page  and  record  logging  and  in  each 
case  we  examine  two  different  recovery  algorithms  and 
evaluate  the  improvement  achieved  by  adding  RDA 
recovery  to  them.  As  far  as  storage  is  concerned,  the 
extra  cost  involved  in  using  RDA  recovery  is  that  of 
the  twin  page  scheme  for  the  parity  which  is  (100/^)% 
of  the  initial  data  storage  cost. 

RDA  recovery  reduces  the  amount  of  UNDO  log- 


ging  and  hence  is  appropriate  for  systems  using 
update-in-place  which  implies  ->ATOMIC propagation 
and  a  STEAL  policy  for  page  replacement.  We  there¬ 
fore  restrict  ourselves  to  the  analysis  of  such  algo¬ 
rithms.  Within  this  class  of  algorithms  we  examine 
both  the  FORCE  and  ->FORCE  strategies  for  EOT- 
processing.  For  algorithms  of  the  type  -» ATOMIC , 
STEAL,  FORCE,  only  a  TOC  checkpointing  policy 
makes  sense.  For  algorithms  of  the  type  -i ATOMIC , 
STEAL,  -<FORCE,  both  ACC  or  TCC  checkpoints 
could  be  used  however  algorithms  using  ACC  check¬ 
pointing  were  shown  to  outperform  those  using  the 
TCC  type5  [13].  Hence  we  only  look  at  the  former 
type  of  checkpointing. 

We  use  the  same  basic  model  as  the  one  introduced 
by  Reuter  in  his  evaluation  of  the  performance  of  sev¬ 
eral  database  recovery  techniques  [13].  We  assume 
that  the  system  is  I/O  bound  and  therefore  we  look 
only  at  the  number  of  I/O  requests  required  to  perform 
a  given  operation.  We  also  assume  that  the  system  is 
running  continuously  with  no  periodic  shutdown.  This 
implies  that  all  cleanup  activities  required  by  the  algo¬ 
rithm  are  accounted  for  in  the  cost  calculations  instead 
of  assuming  they  are  performed  by  some  background 
process  or  during  shutdown  periods. 

The  workload  considered  consists  of  a  set  of  P 
transactions  executing  concurrently  in  the  system. 
Transactions  are  of  two  types:  update  or  retrieval.  The 
fraction  of  update  transactions  is  fu .  Each  transaction 
accesses  s  database  pages.  The  fraction  of  accessed 
pages  that  are  modified  by  an  update  transaction  is  pu. 
To  characterize  the  behavior  of  the  database  buffer,  we 
use  the  communality  C  which  denotes  the  probability 
that  a  page  requested  by  an  incoming  transaction  is 
present  in  the  buffer.  It  is  assumed  that  the  buffer  is 
sufficiently  large  so  that  once  a  transaction  has  refer¬ 
enced  a  page,  the  page  will  remain  in  the  buffer  until 
it  is  no  longer  needed  by  the  transaction® . 

The  cost  of  recovery  after  a  system  crash  is  denoted 
by  c,  and  is  measured  by  the  number  of  page  trans¬ 
fers  between  main  memory  and  the  disk  subsystem 
required  to  perform  recovery.  The  cost  of  executing  a 
transaction  is  denoted  by  ct.  The  transaction  through¬ 
put  r,  is  defined  as  the  number  of  transactions  pro¬ 
cessed  during  an  availability  interval.  An  availability 
interval  T  is  the  period  between  two  system  crashes. 
Since  all  cost  measures  are  evaluated  in  terms  of  num- 

1  Also  TCC  checkpointing  contradict!  our  assumption  of  a 
continuously  running  system  since  it  requires  the  establishment 
of  a  quiescent  point  where  no  update  transactions  are  present 
in  the  system. 

8  The  page  could  still  be  replaced  before  the  transaction  com¬ 
mits  if  a  STEAL  policy  is  used,  however  if  it  is  replaced  it  will 
not  be  rereferenced  by  the  transaction. 


ber  of  I/O  operations,  we  assume  that  the  availability 
interval  is  measured  in  units  of  page  transfers. 

Let  cT  denote  the  cost  of  updating  a  retrieval  trans¬ 
action  and  cu  that  of  an  update  transaction.  Then  ct 
can  be  obtained  by:  ct  =  (1  —  /u)cr  +  fu Co¬ 
in  the  following,  we  derive  the  complete  cost  equa¬ 
tions  for  algorithms  of  the  type  -i ATOMIC ,  STEAL, 
FORCE,  TOC  in  the  case  of  record  logging.  How¬ 
ever  the  cost  equations  for  the  -> ATOMIC ,  STEAL, 
-« FORCE,  ACC  type  of  algorithms  and  the  equations 
for  the  case  of  record  logging  have  been  omitted  due 
to  lack  of  space.  These  equations  and  their  derivations 
can  be  found  in  [14]. 

5.1  Probability  of  Logging 

We  consider  a  set  of  K  pages  that  have  been  mod¬ 
ified  by  active  transactions  and  we  compute  the  ex¬ 
pected  value  of  the  size  (X )  of  the  subset  of  pages  that 
can  be  written  back  to  the  database  without  UNDO 
logging.  Let  5  be  the  total  number  of  data  pages  in  the 
database.  We  assume  that  the  K  pages  are  randomly 
chosen  from  the  5  pages  in  the  database.  Note  that 
by  using  data  striping  (RAID)  with  a  large  striping 
unit  or  parity  striping,  any  sequentiality  in  database 
accesses  acts  in  favor  of  our  scheme  by  distributing  the 
pages  accessed  over  distinct  parity  groups.  Hence  as¬ 
suming  that  the  pages  are  randomly  distributed  leads 
to  conservative  results.  The  expected  value  of  X  is 

given  by:  E[X]  =  jj  ^1  —  and  the  probability 

of  logging  is  obtained  using  pi  =  1  -  E[X]/K.  The 
derivation  of  the  expression  for  E[X]  is  omitted  and 
can  be  found  in  [14]. 

5.2  Algorithm  of  the  Type  -i ATOMIC , 
STEAL,  FORCE,  TOC  with  page  log¬ 
ging 

With  the  FORCE  discipline,  the  checkpoint  is  taken 
at  the  end  of  each  transaction.  The  cost  of  checkpoint¬ 
ing  is  therefore  accounted  for  in  the  cost  of  logging. 
Hence  the  throughput  is  given  by: 

rt  =  (T  —  c,)/ct. 

Given  our  assumption  that  pages  are  not  rereferenced 
by  the  calling  transaction  after  they  have  been  re¬ 
placed  in  the  buffer,  the  cost  of  writing  and  logging 
a  page  will  be  the  same  whether  the  page  is  stolen 
from  the  buffer  before  transaction  commit  or  whether 
it  stays  in  the  buffer  until  EOT  and  is  then  logged 
and  written  to  the  database.  Hence  we  will  account 


for  all  the  costs  involved  in  logging  the  pages  and  writ¬ 
ing  them  back  to  the  database  as  part  of  the  cost  of 
logging.  Hence  we  obtain  the  following  equations  for 
for  cr  and  cu  : 

Cr  =  s(l  -C) 

e„  =  s(l  -  C)  +  ci  +  piCh 

where  cj  is  the  cost  of  logging  the  transaction,  p»  is 
the  probability  of  a  transaction  abort  and  c»  is  the 
cost  of  backing  out  the  transaction  in  the  case  where 
an  abort  occurs.  The  expression  for  cj  is: 

ci  =  3  x  spu  +  4  x  (2spu)  +  4x4 

The  first  term  is  the  cost  of  writing  the  pages  back  to 
the  database.  Each  write  to  the  disk  array  costs  three 
I/O  operations  since,  with  the  FORCE  discipline,  the 
old  data  is  kept  in  the  buffer  until  EOT  for  the  pur¬ 
pose  of  UNDO  logging.  The  second  term  is  the  cost  of 
writing  to  the  UNDO  and  REDO  log  files.  REDO  in¬ 
formation  is  needed  only  in  the  case  where  an  operator 
error  or  a  system  software  error  damages  more  than 
one  disk  in  the  disk  array.  The  log  files  are  stored 
separately  which  makes  reading  the  log  to  back-out 
aborted  transactions  less  costly.  The  last  term  in  the 
expression  of  cj  is  the  cost  of  writing  BOT  and  EOT 
records  to  each  of  the  log  files. 

The  probability  of  having  to  log  a  page  with  RDA 
recovery  is  dependent  on  the  number  K  of  pages  writ¬ 
ten  back  to  the  database  by  incomplete  transactions. 
We  assume  that  when  a  transaction  writes  back  a  page 
to  the  database  before  committing,  the  other  concur¬ 
rent  transactions  are  halfway  through  writing  their 
own  modified  pages.  Therefore  K  is  equal  to  half  the 
total  number  of  pages  modified  by  concurrent  update 
transactions.  Hence,  in  the  expression  for  the  proba¬ 
bility  of  logging  obtained  in  Section  5.1 ,  K  must  be 
replaced  with7  Ps/upu/2.  With  RDA  recovery,  the 
formula  for  the  cost  of  logging  becomes: 

c{  =  (3  +  2 pi)spu  +  4(spu  +  spupi  +  4)  +  4(pi  -  pjp' ) 

The  major  difference  with  c;  is  that  UNDO  logging  has 
to  be  performed  only  when  the  parity  group  is  dirty, 
i.e.,  with  probability  pi.  The  term  2 pi  is  added  to  3 
to  account  for  the  fact  that  when  writing  to  a  dirty 
parity  group  both  parity  pages  need  to  be  updated8. 
The  last  term  in  the  expression  of  cj  denotes  the  cost 
of  writing  the  log  chain  header  to  the  log.  The  header 

7Page  logging  implies  the  use  of  page  locking  and  hence  the 
seta  of  pages  modified  by  concurrent  update  transactions  are 
disjoint. 

*  We  assume  that  log  file  pages  and  data  pages  do  not  belong 
to  the  same  parity  groups. 


is  normally  written  along  with  the  BOT  record  in  the 
same  page  except  when  the  first  page  written  by  the 
transaction  to  the  database  has  to  be  logged  and  not 
all  pages  updated  by  the  transaction  have  to  be  logged. 

To  evaluate  cj  we  assume  that  a  transaction  aborts 
in  the  middle  of  processing  its  pages  and  that  the  other 
concurrent  update  transactions  have  also  logged  half 
their  modified  pages.  The  UNDO  log  has  to  be  read 
up  to  the  BOT  record  of  the  aborting  transaction. 

c»  =  (pas/2)(Pfu)  +  Pfu  +  4(pus/2)  +  4 

The  first  term  is  the  number  of  before-images  that 
have  to  be  read  from  the  log.  The  second  term  is  the 
number  of  BOT/EOT  records  to  be  read.  The  third 
term  is  the  number  of  page  transfers  to  and  from  the 
database  to  undo  the  modifications  performed  by  the 
aborting  transaction  and  the  last  term  accounts  for 
the  writing  of  a  rollback  record.  With  RDA  recovery 
the  above  formula  becomes: 

c'b  =  (PuPl8/2)Pfu  +  (p,  -  p‘,p')Pfu  +  Pfu  + 
(pus/2)(6pi  +  5(1  -  pi))  +  4 

In  the  first  term  the  number  of  logged  before-images  to 
be  read  is  now  multiplied  by  pj.  The  second  term  is  the 
expected  number  of  log  chain  headers  to  be  read  from 
the  log.  The  other  major  difference  is  in  the  fourth 
term.  It  is  due  to  the  fact  that,  when  recovering  a  page 
that  has  been  logged,  up  to  six  I/O  operations  might 
be  necessary  since  its  parity  group  may  still  be  dirty9. 
On  the  other  hand,  if  the  page  has  been  written  to  the 
database  without  being  logged,  it  is  necessary  to  read 
both  parity  pages  in  its  parity  group  and  the  “new” 
data  page  and  then  overwrite  the  database  page  with 
the  old  data  and  modify  the  state  of  the  parity  page 
from  working  to  invalid  by  resetting  the  timestamp  in 
its  header.  Hence  five  I/O  operations  will  be  necessary 
in  the  latter  case. 

After  a  system  crash,  only  UNDO  recovery  needs 
to  be  performed.  Hence  the  formula  for  c,  contains 
the  cost  of  reading  the  UNDO  log  file  up  to  the  BOT 
record  of  the  oldest  transaction  alive  at  the  time  of  the 
crash  and  then  overwriting  the  modifications.  The 
work  of  the  oldest  transaction  alive  overlapped  with 
the  work  of  some  committed  transactions  therefore  the 
log  records  for  half  the  work  of  about  2 Pfu  transac¬ 
tions  need  to  be  read.  Hence  the  expressions  for  c, 
and  d,  are: 

c,  =  Pfu(spu  +  2)  +  4(F/upus/2) 

9  Here  we  use  an  upper  bound  for  the  costs  involved  in  RDA 
recovery  in  order  to  keep  things  simple.  This  will  lead  to  a 
conservative  estimate  of  the  benefit  of  our  method. 
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e'.  =  Pfu(spuPi  +  2(p,-p,,p')  +  2)  + 

Pfu(Pus/2)(4Pl  +  5(1  -  Pl))  +  S/N. 

The  term  S/N  is  an  upper  bound  for  the  cost  of  re¬ 
constructing  the  bit  map  for  the  current  parity  page. 

5.3  Results 

We  evaluate  the  algorithms  in  two  different  environ¬ 
ments  depending  on  the  frequency  of  update  transac¬ 
tions.  The  first  four  rows  of  Table  1  show  the  through¬ 
put  as  a  function  of  the  communality  C  both  in  a 
system  with  high  update  frequency  and  in  a  system 
with  high  retrieval  frequency  for  algorithms  of  type 
-> ATOMIC,  STEAL,  FORCE,  TOC  with  page  log¬ 
ging.  As  expected  the  improvement  in  throughput 
using  RDA  recovery  is  much  more  significant  in  the 
high  update  frequency  environment.  For  the  latter 
environment  and  for  C  =  0.9  the  increase  in  through¬ 
put  is  about  42%.  The  entries  shown  in  the  table  are 
actually  values  of  r*/100.  All  the  values  for  the  dif¬ 
ferent  parameters  of  the  model,  except  for  N,  were 
taken  from  [13].  These  values  are:  5  =  5000,  N  =  10, 
P  =  6,  p»  =  0.01  and  T  =  5.10®.  For  the  high  up¬ 
date  frequency  environment,  a  —  10,  /u  =  0.8  and 
pu  =  0.9  while  for  the  high  retrieval  frequency  envi¬ 
ronment,  a  =  40,  fu  =0.1  and  p„  =  0.3. 

The  following  four  rows  of  Table  1  show  the  re¬ 
sults  for  both  environments  for  an  algorithm  of  type 
ATOMIC,  STEAL,  - FORCE ,  ACC  with  page  log¬ 
ging.  It  can  be  seen  that  the  improvement  is  not  sig¬ 
nificant  in  this  case.  However  the  interesting  result 
is  that  while  without  RDA  recovery,  the  ->FORCE, 
ACC  type  algorithm  outperforms  the  FORCE,  TOC 
scheme,  when  RDA  recovery  is  used,  the  situation  is 
reversed  and  the  latter  algorithm  outperforms  the  for¬ 
mer  by  a  significant  margin. 

The  last  eight  rows  of  Table  1  show  the  results  in 
the  case  of  record  logging.  Unlike  the  page  logging 
case,  the  -<FORCE,  ACC  scheme  performs  much  bet¬ 
ter  than  the  FORCE,  TOC  scheme  for  the  range  of 
values  of  C  encountered  in  typical  applications  [15]. 
Also,  for  the  ->FORCE,  ACC  algorithm,  the  increase  in 
throughput  achieved  by  using  RDA  recovery  is  higher 
than  for  the  same  algorithm  with  page  logging.  This 
is  the  case  because,  with  record  logging,  the  cost  of 
logging  the  updates  of  a  stolen  page  is  high  relatively 
to  the  cost  of  logging  non  stolen  pages  and  RDA  recov¬ 
ery  reduces  that  cost  by  eliminating  the  need  for  log¬ 
ging  stolen  pages  in  most  cases.  For  example,  for  the 
high  update  frequency  environment  and  for  C  =  0.9, 
the  increase  in  throughput  is  about  14%.  The  benefit 
of  RDA  recovery  increases  with  the  amount  of  work 
performed  by  each  transaction.  Table  2  shows  the 


percent  increase  in  throughput  achieved  by  RDA  re¬ 
covery  combined  with  the  -> FORCE ,  ACC  algorithm 
as  a  function  of  the  number  of  pages  accessed  by  each 
transaction  (s)  for  the  high  update  frequency  environ¬ 
ment  with  C  =  0.9. 


6  Conclusions 

In  this  paper,  we  have  presented  a  scheme  that  uses 
redundant  disk  arrays  to  achieve  rapid  recovery  from 
media  failures  in  database  systems  and  simultaneously 
provide  support  for  recovery  from  transaction  aborts 
and  system  crashes.  The  redundancy  present  in  the 
array  is  exploited  to  allow  a  large  fraction  of  pages 
modified  by  active  transactions  to  be  written  to  disk 
and  updated  in  place  without  the  need  for  undo  log¬ 
ging  thus  reducing  the  number  of  recovery  actions  per¬ 
formed  by  the  recovery  component.  The  method  uses 
a  twin  page  scheme  to  store  the  parity  information  so 
that  it  can  be  efficiently  used  in  transaction  undo  re¬ 
covery.  The  extra  storage  used  is  about  (100/#)%  of 
the  size  of  the  database,  N  being  the  number  of  disks 
in  the  array. 

We  used  a  detailed  analytical  model  to  evaluate  the 
benefit  of  our  scheme  in  a  system  equipped  with  redun¬ 
dant  disk  arrays.  We  found  that,  in  the  case  of  page 
logging,  a  FORCE,  TOC  algorithm  combined  with 
RDA  recovery  significantly  outperforms  a  FORCE, 
TOC  algorithm  without  RDA  recovery  as  well  as 
~>FORCE,  ACC  type  of  algorithms.  In  the  case  of 
record  logging,  we  found  that  a  -> FORCE,  ACC  al¬ 
gorithm  performs  best  and  that  the  addition  of  RDA 
recovery  to  it  improves  significantly  its  performance 
especially  for  transactions  with  a  large  number  of  up¬ 
dated  pages. 
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